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Article history: The developing utilization of web has advanced a simple and quick method 
i for e-correspondence. The outstanding case for this is e-mail. Presently days 
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to hamper the improvement of Internet e.g. Advertisement and many more. 


Spammers introduced the new technique of embedding the spam mails in the 
Keywords: attached image in the mail. In this paper, we proposed a method based on 

combination of SVM and KNN. SVM tend to set aside a long opportunity to 
KNN : ae ; " " 

: ; prepare with an expansive information set. On the off chance that "excess 

Spam filtering techniques examples are recognized and erased in pre-handling, the preparation time 
Spam image could be diminished fundamentally. We propose a k-nearest neighbor (k-NN) 
SVM based example determination strategy. The strategy tries to select the 
examples that are close to the choice limit and that are effectively named. 
The fundamental thought is to discover close neighbors to a question test and 
prepare a nearby SVM that jelly the separation work on the gathering of 
neighbors. Our experimental studies based on a public available dataset 
(Dredze) show that results are improved to approximately 98%. 
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1. INTRODUCTION 

Email is a widespread technology nowadays because of its speed time added to its cheap. Email 
Spam defined as unsolicited bulk email, it is a major problem for internet networks [1], [2], [3]. With the 
proliferation of malicious software, spammers have been able to launch large and widespread campaigns that 
cause economic losses and increase traffic. Late investigations uncovered that spam movement constitutes 
over 89% of internet activity, As of late spammers have embraced a new style of spam, that is the spam 
image trick to make the examination of messages' body content inefficient. Spam image is an endeavor by 
spammers to conceal their message from hostile to spammers. Spammers send their messages in a joined 
image that is intelligible by human and hidden from a text-based filter and becomes more difficult to detect. 
Spammer uses images in an e-mail message, which includes the goal of the spammer. The cost of managing 
spam is greater compared to the cost of transmission. This cost is due to waste of network resources, 
increased traffic and significant economic losses, and a decrease in employee productivity [1]. After the 
adoption of the splash on the unwanted images in the inclusion of their goal became filters based on the text 
is ineffective in the detection of unwanted images led to the need for filters based on images. 

The main issue in the spam image filtering is to create an efficient algorithm of the spam image 
filtering to separate the spam email image from other popular images in the email. Many techniques have 
been proposed in filtering this type of image in email, all spam image filtering techniques belong to three 
main groups [4], [5] these are the header based strategies of e-mail consists of many fields that provide a 
useful information margin [4], OCR based techniques using OCR tool to extract the text embedding into 
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image [5], [6], [4] and content-based strategies. In content-based strategies, the analysis and studying about 
the image substance and features, for example, shading, edge, surface, and so on, are used for filter spam 
image from other normal images [6], [5], [4]. In this paper, we proposed filtering method based on gray level 
co-occurrence matrix (GLCM) to extract image texture features. The classification between image spam or 
ham using Support Vector Machine (SVM), k-nearest neighbor (KNN) and also combination of the two 
techniques (SVM and KNN). Figure 1 shows samples of spam images. 


Oewntoed the Latest 2012 Vershon 


Gat hha Latent varsien there 


Figure 1. Examples of the image spam 


The rest of the paper is organized as follows. In section 2, a brief review of present related works. 
Section 3 provides a proposed system. Section 4 presents performance evaluation. In section 5 presents the 
result. Finally, Section 6 concludes conclusions. 


2. LITERATURE SURVEY 

Many discussions have been carried out previously on image spam detection. This section of the 
paper provides an overview of relevant research work in image spam classification. In 2017 Rui Chan 
proposed system includes three-layer spam filtering. Spam is filtered by analyzing both the header and the 
image. The structure of the model explicates carefully the idea of the design and many technologies related to 
the model. Experimental results show that this system has a satisfactory filtering effect [7]. 

In 2015 Monireh sadat Hosseinia et. al Suggested a method for spam image filtering, and image 
texture feature was used to classify the spam image. The gray level co-occurrence matrix has been applied to 
each image. The properties obtained are 22 features and then the k-nearest neighbor classifier and naive 
Bayesian are used to evaluate the images obtained from the both of works database Dredze and Image Spam 
Hunter [4]. In 2015 T. Kumaresan et. al suggested a scheme which extracts the features especially low-level 
features (like metadata and histogram features of images). An SVM classifier with kernel function is used to 
identify a spam image based on extracted features, the accuracy of this method 90%, but the time complexity 
still is a problem in this work [8]. 

In 2014 Jianyi Wang et. al proposed an approach that was based on combines the characteristics of 
spam images with the corner point density to detect. The general idea of the algorithm is based on the corner 
proportion of the images to judge if it is a spam or not spam [9]. In 2015 Nisha D. Chopra et. al used two 
methods to classify spam images. The first method using OCR tool for separating text from the image, and 
the second method is used a Bayesian algorithm to detect the words in the mail are spam or not spam [10]. In 
2014 Meghali Das et. al proposed a method that based on analyzing the image that contains only a text 
region. Then classify the embedded image as spam or legitimate accordingly, they tested their method on 
Dredze dataset [11]. 
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3. RESEARCH METHOD 

In this section, we discuss the main steps of our proposed system. The goal of our works is to create 
a system that is able to distinguish between ham images and spam images based on texture and content 
characteristics. The procedure of extracting features from the image attached to an email is delineated in 
Figure 2. This procedure consist of the following stages: 


z : ji Pre-processing Features Extraction 
Unknown image- 
based Email -Image format GLCM 
unification 
-Convert colored | 


images to a grayscale 
image 


Classification 
KNN 
SVM 
KNN-SVM 


Figure 2. Proposed system general architecture 


3.1. Dataset 

Dataset is used in our work is Dredze. [12] This dataset contains e-mail images with different sizes 
which are (3299) spam images of e-mail and (2021) images of legitimate (ham) e-mail. A set of images has 
been deleted during the processing phase because these images do not provide enough information and its 
size is very small close to tens of bytes, or some of these images are already empty does not contain 
information texture. This led to 3264 for spam image and 1783 for ham image. 


3.2. Pre-processing stage 

Preprocessing stage has the main advantage which is organizing the data in order to simplify 
classification. All operations that apply to a scanned image is called preprocessing process, in order to reduce 
or eliminate noise data and keep only the desired information to make the next operation (feature extraction 
process) easy to implement. The pre-processing stage consists of many operations such as: 


3.2.1. Image format unification in JPEG format 

JPEG is one of the most recognizable and popular raster image formats. This format appeared as a 
result of the “Joint Photographic Experts” work. The selection of JPEG format because it is proven to be an 
effective format in classification process [13]. 


3.2.2. Convert colored images to a grayscale image 

The process that converts the color images to grayscale is aimed to save as much information about 
the original color image as possible. The conversion process from a color image to a grayscale image requires 
more knowledge about the color image. A pixel color in an image is a combination of three colors Red, 
Green, and Blue (RGB). The conversion of a color image into a grayscale image is converting the RGB 
values (24 bit) into grayscale value (8 bit) [14]. When the image is denoted in the RGB model, it has Red, 
Green, and Blue components: let R, G and B are the value of these components, respectively then the gray 
value can be obtained by using Equation 1. 


RGB =.2989* R+.5870*G+.1440*B (1) 


3.2.3. Resizing images 

In this step, all images in the dataset are unified to the same size to prepare it for another process 
which is features extraction. Through our experience, we found that resizing of images to [65x65] gave the 
best results. 


3.3. Features extraction 

After the pre-processing stage has been achieved, feature extraction has applied on the image to 
extract some feature and represented it as feature vector there are many feature extraction methods that are 
used in differing applications. Some of them may succeed in one application and fail in another. The selected 
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feature extraction method is an important step in order to achieve a high classification rate; in our 
experiment, we used the Gray-Level Co-occurrence matrix (GLCM) method. 


3.3.1. Gray-level co-occurrence matrix method 

The texture could be a characteristic sight of the surface and is a crucial characteristic to explain the 
various elements of the image. The aim of the study of texture to seek out how to explain the essential 
options of the image and displays them in an exceedingly single and straightforward kind which might be 
wont to accurately classify. The GLCM, is a two dimensional matrix g (I, j) that reveals properties the spatial 
distribution of the gray-levels within the texture image, Where the element (i, j) of the matrix is the number 
of times the pair of pixels with the value of i and the other pixel in values j and the distance between them 
is d. The number of rows and columns in the array is equal to the number of gray levels in the original image. 
In our work, we used the three corners of the matrix (0, 90 and 135) between the pixel and the neighbor pixel. 
The probability for each pair (i, j) is computed according to the following equation. 


aii) =8/ > > ai) (2) 
i j 


From the co-occurrence matrix (ga,0) twelve features can be derived are Energy, Entropy, Contrast, 
Homogeneity, correlation, and others as shown in Table 1. 


Table 1. Gray-Level Co-occurrence Matrix (GLCM) Features 


Feature number Measure 
Fl Energy = 22,8 D? 
F2 Entropy = -ZX J) log: g(i j) 
F3 Contrast = > Xe —j)*2@/) 
F4 Homogeneity = 2 14 — PE gi j) 


F5 Dissimlarity = >Date j)» li-il 
F6 TTS i» g(i, j) 

i j 
F7 Mean jj D * (i,j) 


j 
F8 Variance 101 = Xi) gG, j)» |i- m|? 
i Variance Joj = Xi X) gG j) * 1j- wil? 
F10 standard deviation I= fo 

F11 Standard devitionj = | oj 

F12 Maximum probability= max g(i, j) 


3.3.2. Normalization 

Normalization is considered as an imperative information preprocessing to stay away from 
properties in more prominent numeric reaches overwhelming those in littler numeric reaches, Highlight 
normalization, or feature scaling, is an essential system for information pre-processing. With a reasonable 
inspiration to roughly even out the range and weight of information traits [15], there are several ways to 
normalization but one of the least difficult and most broadly utilized detailing is the in the range (Min, Max). 
Assume that: 


I: {X S R”} > {Min,.., Max} 
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Normalization transforms an n-dimensional grayscale image (I) with intensity values in the range 
(Min, Max), into a new image. 


IN: {X S R"} > {newMin,.., newMax} 


With intensity values in the range (newMin, newMax). The linear normalization of a grayscale digital image 
is performed according to the formula [16]: 


IN = (I- Min) — +newMin (3) 


3.4. Features extraction 

Classifiers are used for different purposes [17], in this paper are used for classifying the image into 
two classes as ham or spam by comparing its features with one of a given set of classes. A classifier is used 
to identify an object by using its features, and then these features are compared and saved as models for the 
classes trained. In the testing phase, it will identify the unknown object by extracting its features and then 
compared with the features, In our experiments, we used the class SVM as well as the KNN as well as our 
work combination between the SVM and the KNN for several reasons, such as to improve the puncture and 
reduce the time and storage and will be presented in detail in the section SVM-KNN. 


3.4.1. SVM 

Support vector machine is powerful classification systems in data classification, it includes solving 
quadratic problems and this requires a great time for training and big memory for huge scale issues [18], 
a support vector machine (S:VM) can be utilized when our information has completely two classes. An SVM 
characterizes information by finding the ideal hyperplane that isolates all information purposes of one class 
from those of alternate class. The hyperplane for an SVM implies the one with the biggest edge between the 
two classes [19]. Margin implies the maximal width of the bit parallel to the hyperplane that has no inside 
information focuses [8], SVM has a place with a group of generalized linear classifiers and it can be 
translated as an expansion of the perception [20]. A unique property is that they at the same time limit the 
empirical classification error and amplify the geometric margin thus they are otherwise are named maximum 
margin, Figure 3 shows SVM Shown classifier. 
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Figure 3. Support vector machine [8] 


3.4.2. KNN 

K-Nearest Neighbor algorithm (KNN) is a type of supervised learning which is used in several 
applications in the field of image classification, data mining, and many others. KNN can be calculated by 
several distance metrics the best metrics are Euclidean distance can be calculated as follow [14]. Xi, xj are 
two vector xi= (Xil, X12, Xi3, Xi4, Xi5....... Xin) and Xj= (xj1, xj2, Xj3, Xj4, Xjs-.. Xj) distance calculated as follow: 


D (Xi, XD) = |E} (ein — xy) m 
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The K-NN calculation is powerful and clear to actualize. In any case, one of the primary 
disadvantage of K-NN is its inefficiency for large-scale high dimensional data sets [21], The principle 
purpose behind its the downside is its “lazy” learning algorithm natures calculation and it is since it doesn't 
have a genuine learning stage and that comes about a high computational cost at the characterization time. 


3.4.3 KNN-SVM 

The SVM has a good performance but contains some problems which take a great time and the use 
of the CPU and the use of the actual memory, considering the training and classification, especially when the 
dimensions between the data is high, adding that when training requires a few data, this mean the number of 
data for training less from data for test , while the way KNN classification performs the simple and 
low-cost [21] so we found through our work to classify spam images in email to simplify the process of 
training and optimization of the SVM algorithm and to obtain very efficient results using KNN with SVM. 
Figure 4 shows the proposed combination of KNN-SVM flowcharts to classify email images. 
The steps of this technique are: 
1. Compute distances of the query to all training examples. 
2. Ifthe k neighbors have all the same labels, the query is labeled and exit; else, compute the pair-wise 

distances between the k neighbors; 

3. Convert the distance matrix to a kernel matrix and apply multiclass SVM; 
4. Use the resulting classifier to label the query. 


Figure 4. Proposed system architecture of SVM and KNN classifiers 


4. PERFORMANCE EVALUATION METRICS 
The following standard performance metrics to evaluate the proposed method: accuracy, precision, 
recall, F-measure, which are defined as follows in Table 2 [1], [4]. 
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Table 2. Performance Evaluation Metrics 


Measure Defined as: What it means 
Accuracy TP +TN Percentage of predictions that are correct [22] 
FN + FP + TN + TP 
Precision TP Precision is the level of the right forecast (for spam email) [1]. 
TP + FP 
Recall TP Spam Recall looks at the likelihood of true positive examples being recovered (completeness 
of the retrieval process) [1]. 
F-measure 2 * Precision x Recall F-measure consolidates these two measurements in a single condition which can be 
Precision + Recall deciphered as a weighted average of precision and recall [1]. 


Where FP, FN, TP, TN are characterized as follows [1], [22], [4]. 

False Positive (FP): The number of messages for ham e-mail that are classified incorrectly. 
False Negative (FN): The number of spam e-mail messages that are classified incorrectly. 
True Positive (TP): The correct classification of spam mail. 

True Negative (TN): The correct classification of ham mail. 


Faced aslo 


5. RESULTS AND ANALYSIS 

A GLCM based feature point extraction method for image spam classification system is built. In the 
next, we conduct three sets of experiments to verify the effectiveness and efficiency of our approach. In the 
first set of experiments, we verify the classification performance under the measures of accuracy using SVM 
as a Classifier. In the second set of experiments, the classification performance under the measures of 
accuracy using KNN as a classifier, and in the third experiment the classification performance under the 
measures of accuracy using a combination of KNN-SVM as classifier. Finally, we compare the performance 
of three approaches. 


5.1. Results with applying SVM 

By using SVM classifier, we obtained the average accuracy 0.497 when the train data are (1100, 
1770) for ham and spam image respectively. Table 3 shows the results with different numbers of the training 
samples. 


Table 3. Result of SVM with Different Training Samples 


Spam image (3264 images) Ham image (1783 image) Average Accuracy 
Train Test Accuracyofspamimage Train Test Accuracy for ham image 
50 1494 90.36 50 683 91.51 90.93 
100 1494 89.76 100 683 91.95 90.85 
150 1494 92.10 150 683 91.80 91.95 
200 1494 93.31 200 683 91.95 92.63 
1770 1494 0 1100 683 99.56 49.78 


It can be noted from Table 3, that SVM classifier give appropriated result when the number of 
training samples is small, and the accuracy decrease for spam images equal to (0) when a number of training 
(1770) samples. Figure 5 shows the average accuracy of SVM with a different number of training samples. 
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Figure 5. Accuracy for SVM with different number of training samples 
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5.2. Results with applying KNN 

Using KNN for different K value, 3264 for Spam image (train data 1770 and testing 1494 image) 
and 1783 for Ham image (train data 1100 and testing 683 images), the results are showing in the Table 4 for 
the values of k between 15 to 40. From Table 4, it can be noted that best average accuracy obtained for K in 
the range (15-20). 


Table 4. Result for KNN with Different Values for K 


K values Accuracy for spam image Accuracy for ham image Average Accuracy 


16 92.97 95.92 94.45 
17 95.92 92.09 94.01 
20 96.05 91.95 94 

25 95.25 90.63 92.94 
30 95.45 91.36 93.41 
35 95.18 89.60 92.39 
40 94.78 90.92 92.85 


5.3. Combination of KNN- SVM 

The proposed method tries to select the patterns that are located near the boundary and are correctly 
labeled. In order to do that, A pattern near the decision boundary tends to have neighbors with mixed class 
labels. Thus, the of K-nearest neighbors’ class labels can estimate the K patterns which will be input to SVM. 
Table 5 shows the results of average accuracy for spam and ham images. It can be noted from results that 
combination of KNN-SVM gives best results. Figure 6 shows the performance evaluation metrics for our 
proposed method and Figure 7. Show comparison for performance metrics accuracy, precision, recall, and 
f-measure. The accuracy of our proposed based texture features and some other methods are reported in 
Table 6 to prove the efficiency of our proposed system. 


Table 5. Result for SVM-KNN with Different Values of k 


K values Accuracy for spam image Accuracy for ham image Accuracy 


15 98.80 95.31 97.06 
16 98.80 95.61 97.20 
17 98.80 95.61 97.20 
20 98.93 95.61 97.27 
25 98.80 95.61 97.20 
30 98.93 95.17 97.05 
35 99 95.17 97.08 
40 98.19 94.88 96.54 


Table 6. The Accuracy that Achieved by Our Proposed Method and Other Methods for 
Email Image Classification 


Related Publie Techniques used for image spam filtering/classification Classification Accuracy 
work year 
[7] 2017 Multi-layer algorithm 96.2% 
[4] 2015 k-nearest neighbor classifier (KNN) and naive Bayesian (NB) 91forKNN and 75 for NB classifier 
[8] 2015 using Support Vector Machine and Particle Swarm Optimization 90% 
[9] 2014 Thresholding 91.3% 
[10] 2015 OCR and Bayesian Algorithm Not defined 
O1] 2014 Content analysis Not defined 
Proposed method Texture-based features using a combination of SVM-KNN 97.20 
98 Comparis of Accuracy 0.995 
0.99 
> 26 0.985 
£ 94 E 0.98 
g ‘i —o—SVM... 0.975 
30 0.97 
88 0.965 E 
15 16 17 20 25 30 35 40 0.96 
Various KN Accuracy Precision Recall F-measure 


Figure 6. Shows the performance evaluation metrics Figure 7. Show Comparison for performance metrics 
for our proposed method accuracy, precision, recall, and f-measure 
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6. CONCLUSION 

In this paper, our proposed method for distinguishing the ham and spam images was presented using 
GLCM, which is one of the image texture features. For each image, the 12 features are extracted in three 
directions. These features are the entropy, energy, mean, etc. At first we apply SVM to classify the images as 
ham or spam, But because of the problems of SVM represented by a great time for training and big memory 
for huge scale issues [18], we resorted to KNN to get the best results but also have problems is the pruning of 
the data with high spacing. To improve the SVM performance a combination of SVM and KNN applied to 
get the best accuracy. As shown from Table 5 the average accuracy is 97.27 when the value of K is 20. 
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